Statistical inference, part III

Point and interval estimates

Eva Freyhult

NBIS, SciLifeLab

2022-09-13

Point and interval estimates

The sample proportion and sample mean are unbiased estimates of the population proportion and population mean.

The sample estimate is our best guess, but it will not be without error.

Point and interval estimates

Pollen example

If we are interested in how large proportion of the Uppsala population is allergic to pollen, we can investigate this by studying a random sample. We randomly select 100 persons in Uppsala and observe that 42 have a pollen allergy.

Based on this observation our point estimate of the Uppsla popultation proportion \(\pi\) is \(\pi \approx p = 0.42\).

We know that there is a certain uncertainty in this measurement, if the experiment is repeated we would select 100 other persons and our point estimate would be slightly different.

Bootstrap interval

Using bootstrap we can sample with replacement from our sample to estimate the uncertainty.

Bootstrap is to use the data we have (our sample) and sample repeatedly with replacement from this data.

Put the entire sample in an urn!

Sample from the urn with replacement to compute the bootstrap distribution.

Bootstrap interval

Sample a ball with replacement 100 times and note the proportion allergic (black balls).

Repeat this many times to get a bootstrap distribution

Using the bootstrap distribution the uncertainty of our estimate of \(\pi\) can be estimated.

The 95% bootstrap interval is [0.32, 0.52].

The bootstrap is very useful if you do not know the distribution of our sampled propery. But in our example we actually do.

Confidence interval

A confidence interval is a type of interval estimate associated with a confidence level.

An interval that with probability \(1 - \alpha\) cover the population parameter \(\theta\) is called a confidence interval for \(\theta\) with confidence level \(1 - \alpha\).

Confidence interval of proportions

Remember that we can use the central limit theorem to show that

\[P \sim N\left(\pi, SE\right) \iff P \sim \left(\pi, \sqrt{\frac{\pi(1-\pi)}{n}}\right)\]

It follows that

\[Z = \frac{P - \pi}{SE} \sim N(0,1)\] Based on what we know of the standard normal distribution, we can compute an interval around the population property \(\pi\) such that the probability that a sample property \(p\) falls within this interval is \(1-\alpha\).

Confidence interval of proportions

Remember that we can use the central limit theorem to show that

\[P \sim N\left(\pi, SE\right) \iff P \sim \left(\pi, \sqrt{\frac{\pi(1-\pi)}{n}}\right)\]

It follows that

\[Z = \frac{P - \pi}{SE} \sim N(0,1)\]

Based on what we know of the standard normal distribution, we can compute an interval around the population property \(\pi\) such that the probability that a sample property \(p\) falls within this interval is \(1-\alpha\).

Confidence interval

\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]

Confidence interval

\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\]

\(z_{\alpha/2}\) is the value such that \(P(Z \geq z_{\alpha/2}) = \frac{\alpha}{2} \iff P(Z \leq z_{\alpha/2}) = 1 - \frac{\alpha}{2}\).

For a 95% confidence, \(\alpha = 0.05\), and \(z_{\alpha/2} = 1.96\). For 90% or 99% confidence \(z_{0.05} = 1.64\) and \(z_{0.005}=2.58\).

Confidence interval of proportion

\[P\left(-z_{\alpha/2} < Z <z_{\alpha/2}\right) = 1-\alpha\\ P(-z_{\alpha/2} < \frac{P - \pi}{SE} < z_{\alpha/2}) = 1 - \alpha\]

We can rewrite this to

\[P\left(\pi-z_{\alpha/2} SE < P < \pi + z_{\alpha/2} SE\right) = 1-\alpha\] In words, a sample fraction \(p\) will fall between \(\pi \pm z_{\alpha/2} SE\) with probability \(1- \alpha\).

The equation can also be rewritten to

\[P\left(P-z SE < \pi < P + z SE\right) = 1 - \alpha\]

Confidence interval of proportion

The observed confidence interval is what we get when we replace the random variable \(P\) with our observed fraction,

\[p-z SE < \pi < p + z SE\] \[\pi = p \pm z SE = p \pm z \sqrt{\frac{p(1-p)}{n}}\]

Confidence interval of proportion

The 95% confidence interval \[\pi = p \pm 1.96 \sqrt{\frac{p(1-p)}{n}}\]

Confidence interval of proportion

A 95% confidence interval will have 95% chance to cover the true value.

Confidence interval of proportion

Back to our example of proportion pollen allergic in Uppsala. \(p=0.42\) and \(SE=\sqrt{\frac{p(1-p)}{n}} = 0.0494\).

Hence, the 95% confidence interval is \[\pi = 0.42 \pm 1.96 * 0.05 = 0.42 \pm 0.092\] or \[(0.42-0.092, 0.42+0.092) = (0.32, 0.52)\]

Confidence interval of mean

The mean of a sample of \(n\) independent and identically normal distributed observations \(X_i\) is normally distributed;

\[\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]

If \(\sigma\) is unknown the statistic

\[\frac{\bar X - \mu}{SEM} = \frac{\bar X - \mu}{\frac{s}{\sqrt{n}}} \sim t(n-1)\] is t-distributed with \(n-1\) degrees of freedom.

It follows that

\[ \begin{aligned} P\left(-t < \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} < t\right) = 1 - \alpha \iff \\ P\left(\bar X - t \frac{\sigma}{\sqrt{n}} < \mu < \bar X + t \frac{\sigma}{\sqrt{n}}\right) = 1 - \alpha \end{aligned} \]

Confidence interval of mean

The confidence interval with confidence level \(1-\alpha\) is thus;

\[\mu = \bar x \pm t \frac{s}{\sqrt{n}}\]

For a 95% confidence interval and \(n=5\), \(t=\) 2.7764.

The \(t\) values for different values of \(\alpha\) and degrees of freedom are tabulated and can be computed in R using the function qt.

n=5
alpha = 0.05
## t value
qt(1-alpha/2, df=n-1)
[1] 2.776